import sqlite3
import warnings
import pandas as pd
import plotly.express as px
from scipy.stats import chi2_contingency, chi2, kruskal, spearmanr
from statsmodels.stats.proportion import proportion_confint
from utils.functions import *
The main objectives for this part of the notebook are:
At the end of this section, "Key Takeaways" provides a condensed overview of the main findings and observations. The goal to distil the most important conclusions, setting the stage for bivariate data analysis.
The database consists of three tables: "podcasts", "reviews" and "ratings". They are connected by the "podcast_id", which is used to identify individual podcasts. In this section, we will look at the tables individually.
con = sqlite3.connect("database.sqlite")
query = """
SELECT *
FROM podcasts
"""
pd.read_sql_query(query, con).head()
| podcast_id | itunes_id | slug | itunes_url | title | |
|---|---|---|---|---|---|
| 0 | a00018b54eb342567c94dacfb2a3e504 | 1313466221 | scaling-global | https://podcasts.apple.com/us/podcast/scaling-... | Scaling Global |
| 1 | a00043d34e734b09246d17dc5d56f63c | 158973461 | cornerstone-baptist-church-of-orlando | https://podcasts.apple.com/us/podcast/cornerst... | Cornerstone Baptist Church of Orlando |
| 2 | a0004b1ef445af9dc84dad1e7821b1e3 | 139076942 | mystery-dancing-in-the-dark | https://podcasts.apple.com/us/podcast/mystery-... | Mystery: Dancing in the Dark |
| 3 | a00071f9aaae9ac725c3a586701abf4d | 1332508972 | kts-money-matters | https://podcasts.apple.com/us/podcast/kts-mone... | KTs Money Matters |
| 4 | a000a500f06555f81220c3eb641aded7 | 1544900779 | word-on-the-street-w-dreak-swift | https://podcasts.apple.com/us/podcast/word-on-... | Word on the Street w/ Dreak Swift |
How many podcasts are in the table?
query = """
SELECT count(*)
FROM podcasts
"""
con.execute(query).fetchall()[0][0]
110024
query = """
SELECT
COUNT(CASE WHEN podcast_id IS NULL THEN 1 END) AS podcast_id_null_count,
COUNT(CASE WHEN itunes_id IS NULL THEN 1 END) AS itunes_id_null_count,
COUNT(CASE WHEN slug IS NULL THEN 1 END) AS slug_null_count,
COUNT(CASE WHEN itunes_url IS NULL THEN 1 END) AS itunes_url_null_count,
COUNT(CASE WHEN title IS NULL THEN 1 END) AS title_null_count
FROM podcasts;
"""
pd.read_sql_query(query, con)
| podcast_id_null_count | itunes_id_null_count | slug_null_count | itunes_url_null_count | title_null_count | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
query = """
SELECT DISTINCT podcast_id
FROM podcasts
GROUP BY podcast_id
HAVING COUNT(*) > 1;
"""
con.execute(query).fetchall()
[]
Maybe some podcasts have the same titles?
query = """
SELECT COUNT(*)
FROM (
SELECT title
FROM podcasts
GROUP BY title
HAVING COUNT(*) > 1
)
"""
con.execute(query).fetchall()[0][0]
675
Next, I will explore the variables that will be used to analyze podcast market.
query = """
SELECT *
FROM reviews
"""
all_reviews = pd.read_sql_query(query, con)
all_reviews.head()
| podcast_id | title | content | rating | author_id | created_at | |
|---|---|---|---|---|---|---|
| 0 | c61aa81c9b929a66f0c1db6cbe5d8548 | really interesting! | Thanks for providing these insights. Really e... | 5 | F7E5A318989779D | 2018-04-24T12:05:16-07:00 |
| 1 | c61aa81c9b929a66f0c1db6cbe5d8548 | Must listen for anyone interested in the arts!!! | Super excited to see this podcast grow. So man... | 5 | F6BF5472689BD12 | 2018-05-09T18:14:32-07:00 |
| 2 | ad4f2bf69c72b8db75978423c25f379e | nauseatingly left | I'm a liberal myself, but its pretty obvious a... | 1 | 1AB95B8E6E1309E | 2019-06-11T14:53:39-07:00 |
| 3 | ad4f2bf69c72b8db75978423c25f379e | Diverse stories | I find Tedx talks very inspirational but I oft... | 5 | 11BB760AA5DEBD1 | 2018-05-31T13:08:09-07:00 |
| 4 | ad4f2bf69c72b8db75978423c25f379e | 👍👍👍👍 | I love this podcast, it is so good. | 5 | D86032C8E57D15A | 2019-06-19T13:56:05-07:00 |
Lets see the number of unique podcasts in this table:
query = """
SELECT COUNT(DISTINCT podcast_id)
FROM reviews
"""
con.execute(query).fetchall()[0][0]
111544
How many podcasts (podcast_id) between the tables are matching?
query = """
SELECT COUNT(DISTINCT r.podcast_id)
FROM reviews r
JOIN podcasts p
ON r.podcast_id = p.podcast_id
"""
con.execute(query).fetchall()[0][0]
110024
query = """
SELECT
COUNT(CASE WHEN podcast_id IS NULL THEN 1 END) AS podcast_id_null_count,
COUNT(CASE WHEN title IS NULL THEN 1 END) AS title_null_count,
COUNT(CASE WHEN content IS NULL THEN 1 END) AS content_null_count,
COUNT(CASE WHEN rating IS NULL THEN 1 END) AS rating_null_count,
COUNT(CASE WHEN author_id IS NULL THEN 1 END) AS author_id_null_count,
COUNT(CASE WHEN created_at IS NULL THEN 1 END) AS created_at_null_count
FROM reviews;
"""
pd.read_sql_query(query, con)
| podcast_id_null_count | title_null_count | content_null_count | rating_null_count | author_id_null_count | created_at_null_count | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Duplicates by the ID number (review count):
query = """
SELECT podcast_id, COUNT(*) as number_of_reviews
FROM reviews
GROUP BY podcast_id
ORDER BY COUNT(*) DESC
"""
pd.read_sql_query(query, con).head()
| podcast_id | number_of_reviews | |
|---|---|---|
| 0 | bf5bf76d5b6ffbf9a31bba4480383b7f | 33104 |
| 1 | bc5ddad3898e0973eb541577d1df8004 | 10675 |
| 2 | bad6c91efdbee814db985c7a65199604 | 9698 |
| 3 | f5fce0325ac6a4bf5e191d6608b95797 | 8248 |
| 4 | f2377a9b0d9a2e0fb05c3dad55759328 | 7389 |
What is this podcast?
query = """
SELECT DISTINCT p.title, c.category, p.itunes_url
FROM podcasts p
JOIN reviews r
ON p.podcast_id = r.podcast_id
JOIN categories c
ON p.podcast_id = c.podcast_id
WHERE r.podcast_id = 'bf5bf76d5b6ffbf9a31bba4480383b7f'
"""
pd.read_sql_query(query, con)
| title | category | itunes_url | |
|---|---|---|---|
| 0 | Crime Junkie | true-crime | https://podcasts.apple.com/us/podcast/crime-ju... |
Some people may try to artificially inflate their review count to appear more popular or manipulate the podcast recommendation system. Setting a standard for when a review is considered suspicious is difficult, as it's natural for some reviews to contain the same information. For example, short reviews like "good job" or "great podcast" are common. I will try a few different approaches based on author IDs, content, and the title of the review.
First, let's examine the number of reviews that might be falsified, by looking at instances where a single user leaves more than three reviews with the same title. For this, I've chosen the threshold of 3 because I believe that people are unlikely to write the exact same text multiple times. I did not set this limit to 2 because I think than frequent listeners might repeat a compliment multiple times.
query = """
SELECT SUM(counts)
FROM (
SELECT COUNT(*) counts
FROM reviews
GROUP BY author_id, content
HAVING COUNT(*) > 3
)
"""
con.execute(query).fetchall()[0][0]
2621
Let's run the same test, but this time focusing on duplicate titles instead of review content.
query = """
SELECT SUM(counts)
FROM (
SELECT COUNT(*) counts
FROM reviews
GROUP BY author_id, title
HAVING COUNT(*) > 3
)
"""
con.execute(query).fetchall()[0][0]
21300
How may authors have left these reviews?
query = """
SELECT COUNT(author_id)
FROM (
SELECT DISTINCT author_id
FROM reviews
GROUP BY author_id, title
HAVING COUNT(*) > 3
)
"""
con.execute(query).fetchall()[0][0]
2141
People are probably more sophisticated and would use multiple accounts to leave fake reviews. Let's try a different approach based on identical review content for individual podcasts. Since we are now examining the entire lifespan of the podcast, there's a higher chance of encountering the same reviews. I will look for reviews that appear more than 10 times.
query = """
SELECT SUM(counts)
FROM (
SELECT COUNT(*) as counts
FROM reviews
GROUP BY content, podcast_id
HAVING COUNT(*) > 10
)
"""
con.execute(query).fetchall()[0][0]
524
Previously, using the review title helped identify more suspicious reviews. Let's run the same test focusing on titles this time.
query = """
SELECT SUM(counts)
FROM (
SELECT COUNT(*) as counts
FROM reviews
GROUP BY title, podcast_id
HAVING COUNT(*) > 10
)
"""
con.execute(query).fetchall()[0][0]
40789
Now lets try to identify individual podcasts that have a lot of possible fraudulent reviews. I will use the title, as the content has very few duplications, and continue with the threshold of 15 identical reviews for a podcast.
query = """
SELECT r.podcast_id, rc.number_of_podcasts, COUNT(*) AS num_of_suspicious_podcasts, SUM(r.counts) AS num_of_duplicate_reviews
FROM (
SELECT podcast_id, title, COUNT(*) AS counts
FROM reviews
GROUP BY title, podcast_id
HAVING COUNT(*) > 15
) AS r
JOIN (
SELECT podcast_id, COUNT(*) AS number_of_podcasts
FROM reviews
GROUP BY podcast_id
) AS rc
ON r.podcast_id = rc.podcast_id
GROUP BY r.podcast_id, rc.number_of_podcasts
ORDER BY num_of_suspicious_podcasts DESC
"""
pd.read_sql_query(query, con).head()
| podcast_id | number_of_podcasts | num_of_suspicious_podcasts | num_of_duplicate_reviews | |
|---|---|---|---|---|
| 0 | bf5bf76d5b6ffbf9a31bba4480383b7f | 33104 | 180 | 9215 |
| 1 | bad6c91efdbee814db985c7a65199604 | 9698 | 46 | 2013 |
| 2 | bc5ddad3898e0973eb541577d1df8004 | 10675 | 31 | 1052 |
| 3 | c8bde52bde033bac86f0892998f7c062 | 5099 | 27 | 969 |
| 4 | f5fce0325ac6a4bf5e191d6608b95797 | 8248 | 26 | 666 |
What is this podcast?
query = """
SELECT DISTINCT p.title, c.category, p.itunes_url
FROM podcasts p
JOIN reviews r
ON p.podcast_id = r.podcast_id
JOIN categories c
ON p.podcast_id = c.podcast_id
WHERE r.podcast_id = 'bad6c91efdbee814db985c7a65199604'
"""
pd.read_sql_query(query, con)
| title | category | itunes_url | |
|---|---|---|---|
| 0 | Wow in the World | education | https://podcasts.apple.com/us/podcast/wow-in-t... |
| 1 | Wow in the World | kids-family | https://podcasts.apple.com/us/podcast/wow-in-t... |
| 2 | Wow in the World | kids-family-stories-for-kids | https://podcasts.apple.com/us/podcast/wow-in-t... |
The difficulty mainly comes from deciding when a review seems suspicious. In my quick look at this, I used certain standards, but they were only guesses that could be set higher or lower. Creating a better system to spot questionable reviews or podcasts would need much more work.
The key points are: 1) This approach may not catch all fraudulent reviews, but I will go with the straightforward and less contentious method of removing users who have posted the same review more than three times. 2) Exclude two particular podcasts that show a notably higher number of duplicated reviews compared to others.
By excluding podcasts with minimal review count, we should get a clearer and more representative analysis. Low-review podcasts might suggest creators with limited commitment or only a few uploads. This could distort insights and misrepresent the genuine efforts. I'll remove all podcasts with fewer than three reviews, but this threshold could be set higher.
cursor = con.cursor()
cursor.execute('''CREATE VIEW IF NOT EXISTS FilteredReviews AS
SELECT podcast_id
FROM reviews
WHERE podcast_id != 'bf5bf76d5b6ffbf9a31bba4480383b7f'
AND podcast_id != 'bad6c91efdbee814db985c7a65199604'
AND author_id NOT IN (
SELECT DISTINCT author_id
FROM reviews
GROUP BY author_id, title
HAVING COUNT(*) > 3
)
GROUP BY podcast_id
HAVING COUNT(*) > 3''')
con.commit()
query = """
SELECT COUNT(*)
FROM FilteredReviewss
"""
con.execute(query).fetchall()[0][0]
46911
I will continue data cleaning without these podcasts and reviews.
query = """
SELECT rating
FROM reviews
WHERE podcast_id IN (
SELECT podcast_id
FROM reviews
WHERE podcast_id != 'bf5bf76d5b6ffbf9a31bba4480383b7f'
AND podcast_id != 'bad6c91efdbee814db985c7a65199604'
GROUP BY podcast_id
HAVING COUNT(*) > 3
)
AND author_id NOT IN (
SELECT DISTINCT author_id
FROM reviews
GROUP BY author_id, title
HAVING COUNT(*) > 3
)
"""
df = pd.read_sql_query(query, con)
plot_hist(df)
To visualize the overall trend, I will include podcasts with fewer than three reviews.
query = """
SELECT strftime('%Y-%m-%d', created_at, 'weekday 0', '-6 days') as review_week, COUNT(*) as num_reviews
FROM reviews
WHERE podcast_id != 'bf5bf76d5b6ffbf9a31bba4480383b7f'
AND podcast_id != 'bad6c91efdbee814db985c7a65199604'
AND author_id NOT IN (
SELECT DISTINCT author_id
FROM reviews
GROUP BY author_id, title
HAVING COUNT(*) > 3
)
GROUP BY review_week
ORDER BY review_week
"""
df = pd.read_sql_query(query, con)
plot_line(df)
This table contains the main variable of interest (target feature) for this analysis.
query = """
SELECT *
FROM categories
"""
pd.read_sql_query(query, con).head()
| podcast_id | category | |
|---|---|---|
| 0 | c61aa81c9b929a66f0c1db6cbe5d8548 | arts |
| 1 | c61aa81c9b929a66f0c1db6cbe5d8548 | arts-performing-arts |
| 2 | c61aa81c9b929a66f0c1db6cbe5d8548 | music |
| 3 | ad4f2bf69c72b8db75978423c25f379e | arts |
| 4 | ad4f2bf69c72b8db75978423c25f379e | arts-design |
Number of podcasts in "categories" table:
query = """
SELECT COUNT(DISTINCT podcast_id)
FROM categories
"""
con.execute(query).fetchall()[0][0]
110024
query = """
SELECT COUNT(DISTINCT c.podcast_id)
FROM categories c
WHERE c.podcast_id IN (SELECT podcast_id FROM FilteredReviews)
"""
con.execute(query).fetchall()[0][0]
46240
query = """
SELECT COUNT(DISTINCT category)
FROM categories
"""
con.execute(query).fetchall()[0][0]
110
Does removing very small podcasts and anomalies (reviews), remove any categories?
query = """
SELECT COUNT(DISTINCT category)
FROM categories c
WHERE c.podcast_id IN (
SELECT podcast_id FROM FilteredReviews
)
"""
con.execute(query).fetchall()[0][0]
110
How many podcasts have changed their category?
query = """
SELECT COUNT(*)
FROM (
SELECT c.podcast_id
FROM categories c
WHERE c.podcast_id IN (
SELECT podcast_id FROM FilteredReviews
)
GROUP BY c.podcast_id
HAVING COUNT(*) > 1
)
"""
con.execute(query).fetchall()[0][0]
29592
Because of the duplication, there is no clear way to understand the exact podcast categories, but it is still important to know what this variable contains. Since there are so many categories, they would not fit in a single plot, I will limit the plot to include the catogories that have more than 800 podcasts.
query = """
SELECT category, COUNT(*) AS counts
FROM categories c
WHERE c.podcast_id IN (
SELECT podcast_id FROM FilteredReviews
)
GROUP BY category
HAVING counts > 800
ORDER BY COUNT(*) DESC
"""
df = pd.read_sql_query(query, con)
plot_counts(df, "Distribution Of Podcast Categories (With Duplication)")
Perhaps there is a pattern that could lead to a straightforward solution to the duplication issue. Let's see which categories have the most changes.
query = """
SELECT cc.category, COUNT(*) AS counts
FROM categories cc
WHERE cc.podcast_id IN (
SELECT podcast_id FROM FilteredReviews
)
GROUP BY cc.category
HAVING COUNT(*) > 800
ORDER BY counts DESC
"""
df = pd.read_sql_query(query, con)
plot_counts(df, "Instances When Podcasts Have Converted Into Or Out Of Category")
To tackle this duplication issue, I'm using a "Majority Category Determination" strategy. For each podcast that has multiple category entries, I'll pick the category that appears most often in the whole dataset. This should allow to analyse the broad trends. There is also a "Minority Category Determination" which would identify the least frequent categories associated with each podcast. This method can provide insights into niche or specialized content areas within that may be overlooked in a majority-focused analysis. Ultimately, both methods should be used, and their results compared to provide a more balanced view of the podcast market. For this project, I'll focus only on a single method (Majority Category Determination).
For clarity, the plot will contain only categories with more than 300 podcasts.
query = """
WITH RankedCategories AS (
SELECT
category,
ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS ordering_number
FROM categories c
WHERE c.podcast_id IN (
SELECT podcast_id FROM FilteredReviews
)
GROUP BY category
)
SELECT c.podcast_id, c.category, rc.ordering_number
FROM categories c
JOIN RankedCategories rc
ON rc.category = c.category
WHERE c.podcast_id IN (
SELECT DISTINCT r.podcast_id FROM FilteredReviews r
)
"""
df = pd.read_sql_query(query, con)
df_fixed_categories = df.loc[df.groupby("podcast_id")["ordering_number"].idxmin()]
df_to_plot = df_fixed_categories["category"].value_counts()
df_to_plot_filtered = df_to_plot[df_to_plot >= 300]
plot_counts_series(df_to_plot_filtered)
keep_from_fixed_categories = ["podcast_id", "category"]
keep_from_all_reviews = ["podcast_id", "rating", "author_id", "created_at"]
df = pd.merge(
df_fixed_categories[keep_from_fixed_categories],
all_reviews[keep_from_all_reviews],
on="podcast_id",
)
df.head()
| podcast_id | category | rating | author_id | created_at | |
|---|---|---|---|---|---|
| 0 | a000aa69852b276565c4f5eb9cdd999b | arts | 5 | F43E015ADAF7828 | 2018-05-12T09:31:51-07:00 |
| 1 | a000aa69852b276565c4f5eb9cdd999b | arts | 5 | D3988F2DCD317FB | 2018-04-06T04:14:20-07:00 |
| 2 | a000aa69852b276565c4f5eb9cdd999b | arts | 5 | 923E47C12ACD3D5 | 2018-05-03T19:47:17-07:00 |
| 3 | a000aa69852b276565c4f5eb9cdd999b | arts | 5 | 5A71004E5146D3C | 2018-02-06T15:39:23-07:00 |
| 4 | a000aa69852b276565c4f5eb9cdd999b | arts | 5 | F0B119E5348C41C | 2018-01-31T21:09:40-07:00 |
We have previously looked into the number of reviews per podcast, but now there is a new metric - the number of reviews per category. If this number for a category is low, the insights could be unreliable and possibly wrong.
Let's look into a traditional method for detecting these outliers using IQR:
plot_box(df)
df_below_1200 = df["category"].value_counts()[lambda x: x < 1200]
filtered_categories = df_below_1200.index
df = df[~df["category"].isin(filtered_categories)]
len(df_below_1200)
54
The following issues with the data were identified, and steps taken to address them:
After the cleaning process, the analysis will be based on 47,467 podcasts.
The main objectives for this part of the notebook are:
To understand market situation, information from different feature relations will have to be combined and interpreted. My initial plans are:
1) Look at ratings for each category. Maybe there is something useful. 2) Find categories with the least decrease in the number of reviews - most important factor, since the platform is declining. 3) Discover if there are categories where reviewers engage with content from various different podcasts within the same genre to see if they would be open to new content. 4) Compare the number of podcasts to number of reviews for different categories to see if there is a higher demand for comparatively smaller supply.
In this section, we explore how podcast categories relate to ordinal ratings. Ordinal data can be treated as numeric or categorical, so I'll apply both chi-square and Kruskal-Wallis tests. By using multiple tests, my aim is to validate the findings. Using statistical tests helps understand whether there are differences in ratings across categories and if they are statistically meaningful.
After conducting these tests, I'll will visualize the relationship as categorical-categorical using proportional bar plots, providing a way to interpret the results.
The chi-square test is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies with the frequencies that would be expected, if there was no association. If the calculated chi-square statistic exceeds a critical value from the table based on degrees of freedom and a chosen significance level, which for this project will be 0.05, it suggests a relationship between variables.
Additionally, we calculate a p-value to determine how significant this relationship is. If the p-value is greater than a chosen standard (in this case, 0.05), we reject the null hypothesis in favor of an alternative hypothesis, indicating a significant association between the variables. These two hypotheses are the same for all statistical tests.
alpha = 0.05
contingency_table = pd.crosstab(df["category"], df["rating"])
chi, p_value, degrees_of_freedom, expected_freq = chi2_contingency(contingency_table)
critical_value = chi2.ppf(1 - alpha, degrees_of_freedom)
print(f"Chi-square statistic: {round(chi, 2)}")
print(f"p-value: {p_value}")
print(f"Critical Value: {round(critical_value, 2)}")
Chi-square statistic: 63779.67 p-value: 0.0 Critical Value: 246.97
The Kruskal-Wallis test is a statistical test used to compare the medians of three or more independent groups. The H-statistic in the Kruskal-Wallis test measures the variability in the medians across all the different groups or categories. A larger H-statistic indicates greater variability or differences in the medians of the groups.
I chose this test because, as we've seen from the ratings distribution, our data is heavily skewed, and using the median instead of the mean is likely a better option.
grouped_data = [group["rating"] for name, group in df.groupby("category")]
h_stat, p_value = kruskal(*grouped_data)
print("H-statistic:", round(h_stat, 2))
print("p-value:", p_value)
H-statistic: 58241.25 p-value: 0.0
Since the vast majority of reviews are "5", I consider anything other than "5" to be a negative review.
rating_proportions = pd.crosstab(df["category"], df["rating"], normalize="index")
sorted_proportions = rating_proportions.sort_values(by=5, ascending=False).reset_index()
plot_ratings_categories(sorted_proportions)
Let's check if my idea about niche podcasts having higher ratings solely due to smaller sample size is correct. I will calculate 95% confidence intervals for the proportions of ratings with a value of 5 for the five most highly rated podcasts. For comparison, I'll do the same for the five podcasts with the worst ratings as well.
first_5 = sorted_proportions["category"][:5].tolist()
last_5 = sorted_proportions["category"][-5:].tolist()
names = first_5 + last_5
for name in names:
category_df = df[df["category"] == name]
total_count = len(category_df)
num_ratings_5 = len(category_df[category_df["rating"] == 5])
proportion_5 = num_ratings_5 / total_count
lower_bound, upper_bound = proportion_confint(
num_ratings_5, total_count, alpha=0.05
)
if name == 'fiction':
print('-------------')
print(f"{name}: ({round(lower_bound, 3)}, {round(upper_bound, 3)})")
education-self-improvement: (0.979, 0.99) business-entrepreneurship: (0.974, 0.981) sports-fantasy-sports: (0.959, 0.977) business-management: (0.957, 0.977) business-marketing: (0.96, 0.972) ------------- fiction: (0.803, 0.816) government: (0.791, 0.817) news-politics: (0.722, 0.762) news: (0.713, 0.718) true-crime: (0.705, 0.714)
I have already looked at the plot showing the change in the number of reviews from 2017 and selected the most interesting cases. These include all cases (of which there are few) where the number of reviews is not falling, and two additional lines for "education" and "health-fitness" to comment on the findings. These two lines also represent the decline that iTunes is experiencing.
warnings.filterwarnings("ignore", category=UserWarning, module=".*")
df_plot = df[["category", "created_at"]].copy()
categories_to_keep = [
"education",
"kids-family",
"true-crime",
"education-self-improvement",
"health-fitness",
"health-fitness-alternative-health",
"health-fitness-fitness",
"science",
"society-culture-personal-journals",
]
df_plot = df_plot[df_plot["category"].isin(categories_to_keep)]
df_plot["created_at"] = pd.to_datetime(df_plot["created_at"], format="mixed")
df_plot = df_plot[df_plot["created_at"].dt.year >= 2015]
df_plot = df_plot[df_plot["created_at"].dt.year < 2023]
df_plot["year_month"] = df_plot["created_at"].dt.to_period("M").astype(str)
monthly_reviews = (
df_plot.groupby(["category", "year_month"]).size().reset_index(name="num_reviews")
)
plot_reviews_month(monthly_reviews)
Bad options:
A possibility:
Good options:
True crime has come up three times during this analysis: once when we removed the outlier, next when we looked at the lowest-rated types, and when we found it again among the podcasts that are not yet losing listeners. Let's look to see if the negative reviews are a recent event.
df_true_crime = df[df["category"] == "true-crime"]
df_true_crime = df_true_crime[["rating", "created_at"]].copy()
df_true_crime["created_at"] = pd.to_datetime(df_true_crime["created_at"])
df_true_crime["year_month"] = df_true_crime["created_at"].dt.to_period("M").astype(str)
df_rating_counts = (
df_true_crime.groupby(["year_month", "rating"]).size().reset_index(name="count")
)
df_total_counts = (
df_true_crime.groupby("year_month").size().reset_index(name="total_count")
)
df_merged = pd.merge(df_rating_counts, df_total_counts, on="year_month")
df_merged["proportion"] = df_merged["count"] / df_merged["total_count"]
plot_true_crime_month(df_merged)
Answer this would require more research, but I think it is not. On one hand, a growing discontent among fans could mean a good opportunity to fix issues (maybe replicate what was done right before). On the other hand, fixing content quality is more ambitious and risky.
In a subsection later on, I will try to estimate competition and saturation for all genres, including this on.
In this chapter, I will investigate the relationship between podcast categories and user engagement by analyzing the frequency of repeated user IDs within each category. I explore whether users tend to engage with multiple podcasts within the same category or they are loyal to their favourite podcast.
Let's use a Kruskal-Wallis test to see if there are significant differences. The interpretation of this test is provided in a sub-section above.
Since this essentially creates a new feature, I will display this new data for clarity.
df_users = df[["podcast_id", "category", "author_id"]]
grouped = (
df_users.groupby(["category", "author_id"])["podcast_id"].nunique().reset_index()
)
grouped = grouped.rename(columns={'podcast_id': 'number_of_reviews'})
grouped.head()
| category | author_id | number_of_reviews | |
|---|---|---|---|
| 0 | arts | 000260281A737A3 | 1 |
| 1 | arts | 0003FE5763C5790 | 1 |
| 2 | arts | 00043DCB0A64EA3 | 1 |
| 3 | arts | 00044218B20DAED | 1 |
| 4 | arts | 00047F7D109332F | 1 |
For example: We can see that the user with the ID '000260281A737A3' left a review on a single podcast in the arts category. The relationship we are interested in is between the "category" and "number_of_reviews".
grouped_data = [group["number_of_reviews"] for name, group in grouped.groupby("category")]
h_stat, p_value = kruskal(*grouped_data)
print(f"H-statistic: {round(h_stat, 2)}")
print(f"p-value: {p_value}")
H-statistic: 15628.15 p-value: 0.0
descriptive_stat = (
grouped.groupby("category")["number_of_reviews"]
.describe()
.sort_values(by="mean", ascending=False)
)
min_mean = round(descriptive_stat["mean"].min(), 2)
max_mean = round(descriptive_stat["mean"].max(), 2)
print(f"The range of means is from: {min_mean} to {max_mean}")
print("All unique medians are:", descriptive_stat["50%"].unique()[0])
The range of means is from: 1.0 to 1.19 All unique medians are: 1.0
Let's take a direct approach to examining supply, represented by the number of podcasts, and demand, indicated by the number of reviews. We will be looking for outliers—points that deviate significantly from the general trend. Specifically, we're interested in identifying points with high review counts but low podcast numbers. This analysis allows to point out categories with exceptional listener engagement relative to other categories.
There is probably a strong connection between these two variables, but I'll test it with a statistical test. A correlation test assesses the strength and direction of the relationship between two continuous variables. It measures how changes in one variable correspond to changes in another. A correlation coefficient close to 1 signifies a strong positive correlation, close to -1 indicates a strong negative correlation, and around 0 suggests no correlation. Since I'm keeping the outliers by high review counts, I'll use Spearman's correlation, since it handles them better.
df_summary = (
df.groupby("category")
.agg(num_podcasts=("podcast_id", "nunique"), total_reviews=("rating", "count"))
.reset_index()
)
spearman_corr, spearman_p_value = spearmanr(
df_summary["num_podcasts"], df_summary["total_reviews"]
)
print(
f"Spearman correlation coefficient: {round(spearman_corr, 2)}, p-value: {round(spearman_p_value, 3)}"
)
Spearman correlation coefficient: 0.88, p-value: 0.0
To declutter the plot, I won't display the names of the categories, but they'll be visible when hovering over the points.
plot_podcasts_reviews(df_summary)
Recommendation: Create a podcast for kids.
In addition to analyzing the largest categories, I should do a separate analysis on the data after resolving the duplicated categories by selecting the smaller option. There may be insights hidden at a higher level. It's possible that there are no differences between the smaller and larger encompassing categories, but this would need verification.
Further avenues to explore:
Validate assumptions for each test:
Additional ideas to improve the project: